Guizhen Wang, Purdue University, wang1908@purdue.edu
Junghoon Chae, Purdue
University, jchae@purdue.edu
Hanye Xu, Purdue University, xu193@purdue.edu
Siqiao Chen, Purdue University, chen1722@purdue.edu
William Hatton, United States Air Force Academy, C16william.hatton@usafa.edu
Mahesh Babu Gorantla, Purdue University,
mgorantl@purdue.edu
Benjamin Ahlbrand, Purdue University, bahlbran@purdue.edu
Jiawei Zhang, Purdue University, zhan1486@purdue.edu
Abish Malik, Purdue
University, amalik@purdue.edu
Sungahn Ko,
Purdue University, ko@purdue.edu
Sherry Towers, Arizona State University,
smtowers@asu.edu
David Ebert, Purdue
University, ebertd@purdue.edu
Student
Team: NO
Did you use data from both mini-challenges? YES
Our custom designed system developed for the challenge, R package.
We applied three algorithms for clustering. Please refer to Appendix at
the bottom of this document.
Approximately how many hours were spent
working on this submission in total?
100.
May we post your submission in the
Visual Analytics Benchmark Repository after VAST Challenge 2015 is complete? YES
Video Download
Video:
http://pixel.ecn.purdue.edu:8080/~zhan1486/VASTCHALLENGE15/MC1.wmv
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Questions
MC1.1
– Characterize the attendance
at DinoFun World on this weekend. Describe up to twelve different types of
groups at the park on this weekend.
a. How big is
this type of group?
b. Where does
this type of group like to go in the park?
c. How common
is this type of group?
d. What are
your other observations about this type of group?
e. What can you
infer about this type of group?
f.
If you were to make one improvement to the park to better meet
this group’s needs, what would it be?
Limit your response to no more than 12 images and
1000 words.
In
the provided MC1 dataset with so many different actors (IDs) and aspects, there
is a medley of ways to group people together. The clustering ability of our
tool allowed us to highlight the following groups. We applied three different clustering
algorithms to define groups for different aspects.
We group the people who prefer attending the same attraction and spend a majority of their time at the attraction. These results are obtained using the k-means clustering technique. The results are shown in the form of a line graph (see images below (Left)), where the x-axis shows the attractions, and the y-axis shows the percentage of time people within the cluster spend at a specified attraction. The results are also shown as a graph (Middle), where the distance between the nodes indicates how similar they are. The number of individuals in each cluster is shown on the right of the image.
Figure 1: Results of Kmeans clustering based on attraction preferences.
a. How big is
this type of group?
Among many groups
having different patterns in visiting attractions, we pick the groups that all
spend lot of time in Sabre Tooth Theatre.
Cluster 2 with 619
customers on
Friday,
Cluster 2 with 498
customers on
Saturday,
Cluster 3 with 1383 customers on Sunday
b. Where does
this type of group like to go in the park?
Attraction 64
c. How common
is this type of group?
We find this type
of group to be prevalent every day.
d. What are
your other observations about this type of group?
Because there is a crime that is discovered on Sunday, the police close the stage. So people who prefer to see Scott’s stage show go to the shows presented in 64. This may be the reason the number of people of this group suddenly raised on Sunday (from 498 on Saturday to 1383 on Sunday).
These people do not prefer going to stage to see the Scott’s show. Because the check-in frequency and time they spend on 63 where Scott gives the show are comparatively low.
e. What can you
infer about this type of group?
These people like shows very much.
f.
If you were to make one improvement to the park to better meet
this group’s needs, what would it be?
We would recommend putting the places that host shows closer to one another.
Figure 2: Heatmap of cluster 2 on Saturday 8:00am to 23:30pm.
BASED ON
K-MEANS on attraction categories
Figure 3: Kmeans results based on attraction categories for three days.
We apply our
k-means clustering technique to the time fraction each person spent on every
attraction category, the check-in
time and movement around that attraction category. The result on this method provides a view based on
aggregated categories. We observe that the k-means method
clusters people into 5 groups. The ticks on the x axis are the
attractions that are grouped into thrill rides, kiddie rides,
rides for everyone, food, restrooms, beer gardens, shopping, shows, information, exit and
entry.
Group 1: Cluster with a black line in all three days: people spend much time in checking in thrill rides category and small time for checking-in rides for everyone.
a.
The size of group 1 is the top two groups
having the biggest size. Friday: 804, Saturday: 1796, Sunday: 1936
b.
They like attractions related with thrill
rides.
c.
This group appears every day.
d.
This group is not so much interested in Scott
related activities, since they spend a very small ratio around the shows
attraction category.
e.
This group may be youths since they enjoy
visiting thrill rides and rides for everyone.
f.
It seems like
thrill rides are one of the most popular attraction categories. Currently,
attractions of thrill rides are located quite far away from each other. The
park may locate thrill rides attractions together to reduce travel time.
Group 2: A
cluster with a green (Friday and Sunday) and a blue (Saturday) line: People
spend much time for show
category
a.
The group size is
the middle size, Friday:734, Saturday:1279, Sunday:873
b.
This group enjoys
both rides for everyone and attending Scott’s show.
c.
This group appears every day.
d.
Compared to other groups, this group is very
interested on Scott’s activities. But they still like other attractions and
also enjoy spending time on thrill rides and rides for everyone.
e.
This group may be Scott’s funs. They came
to the park, mainly for Scott.
f.
It seems like people who like show still
have some interest on thrill rides and rides for everyone. The park may
schedule the show time to make sure that the show opening time is when thrill
rides and kiddie rides have a large volume of visitors in order to balance the
visitor volume among the three categories.
Group3:
A cluster with an orange
in three days: People spend more time on kiddie lands.
a.
The group size is the smaller group.
Friday: 321, Saturday: 1984, Sunday: 589
b. Group
4 spends more interest on kiddie rides, meaning Group 4 brings their children
along with them.
c.
This group appears every day.
d.
This group is not very interested on
Scott’s activities
e.
This group may be families with children.
f. It seems that families with children like rides for
everyone and thrill rides in addition to kiddie lands. The park should make
sure that rides for everyone and thrill rides also have some facilities to host
children.
Next we
applied the sequence-based clustering that considers orders in check-ins.
Figure 4: Bar chart representation for sequence clustering results for three days.
Among the clusters generated
by C3, we present the top 10 clusters (ranked based on the number of customers
in every cluster) as shown in Figure 4. This figure shows the clustering results of grouping the people based on
the check-in sequence of attraction categories that they visited. The height of
each row encodes the number of people within the cluster. The color of each bin
within the cluster shows an attraction category. We confirm that the groups in each day enjoy different types
of attractions in different orders.
The largest group on Friday
a. 1619
customers
b. on Friday
took 14 of attractions while the largest group on Saturday with 4765 customers
and Sunday with 6839 customers use 20 and 23 attractions each day. In Figure
4, a group of customers who used the largest number of attractions is easily
recognizable (e.g., 41 customers used 28 attractions on Saturday). In addition,
customers on Friday took the least number of attractions on average compared to
others.
2. People
who come to the park together and remain together throughout the entire day. In
other words, they go to the same places at the same exact time. Our tool
produced clustering results based on check-in sequences in order to identify
these groups. If people check-in at the same locations throughout the day in
the same order, we believe they are travelling together.
a. The
size of this type of group ranges from 2 to 40 people. The average size is 4-6
people.
b. These
groups have no set preference in the park. Since this type of group applies to
almost everyone in the park, there are no clear inclinations for any specific
attraction.
c. This
type of group is quite common. A strong majority of IDs in the dataset come to the
park with at least one other person and travel with these other people
throughout the day.
d. No
other observations.
e. We
can infer this type of group is the appearance of friends, family, school field
trips, etc. coming to the park. Generally, people do not go to amusement parks
alone. The people that fall into a group of this type arrived at the park with
a set group of people on purpose and intended to enjoy the park with them at
all times.
f.
The park could
better meet the needs of these groups by widening the paths in the park so
everyone can walk together closer. Although it is clear these people always
traveled together, a slight difference in their movement totals occurs possibly
because they are not changing grid squares at the same time due to the inability
to walk side-by-side.
WE DO CROSS-VALIDATION BETWEEN CLUSTERING RESULTS AND MC-2 DATA
HERE.
3. People
who do not travel together but communicate with each other by messaging.
Figure 5: Node graph of sequence clustering.
a. This
type of group often occurs between several smaller groups who are travelling
together. The
average size of this group is 7-8 people, but those people may be grouped into smaller groups of 2 to 3 people
who travel together. For example, the group comprised of IDs 163330, 268563, 513541,
651950, 725559, 825258, 879813, 1375106, and 2056236 all communicate together,
but do not all travel together. Rather, these IDs fit into only three groups
that travel together, but still all communicate. In Figure 5, these 9 IDs
communicate together break into 3 three groups highlighted in grey in the
K-means graph, which means they are broken into three clusters of interests.
b. Again,
this type of group has no preference in the park because they travel all over.
c. There
are at least 5 such groups that come to the park every day and between 10 and
15 groups for each individual day.
d. The individuals in these groups
sometimes have
overlapping check-in times at certain attractions, which could mean
they are meeting with the others that they communicate with, but do not travel with.
e. The people in this
group are
likely those who
know each other and split up into smaller groups at the park based on
preference, or simply meet each other at the park, become friends, and
continue to communicate that day.
f.
This group does not necessarily have a need, but the park
could enhance their experience by showing the locations of the friends that
they most communicate with
4. People who travel together, but
do not communicate by messaging. We found there are many clusters presenting customers traveling
together. But we did not find customers in a same cluster communicate each
other based on MC-2 data.
5. Park
personnel: This group includes the park employees and security personnel. For
example, we find that the person with ID 1278894 sends out large group messages
throughout the day. We believe that this person is a park employee who sends
information to visitors who are not currently checked in to any attractions.
Furthermore, we hypothesize that the person with ID 839736 is a park security
personnel who is disseminating information to other park personnel and/or the
general public after the crime was discovered on Sunday.
MC1.2
– Are there
notable differences in the patterns of activity on in the park across the three
days? Please describe the notable
difference you see.
Limit your
response to no more than 3 images and 300 words.
Figure 6: Heatmap based on check-in data of all visitors during Scott's show.
The most notable
is the lack of check-ins to the pavilion (Building #32) and the
performance stage (area #63) on Sunday after 12:00 as shown in box
(a) and (b) Figure 6. On Friday and Saturday, Scott Jones performed a show at the stage at 10:00 AM and 3:00
PM, which garnered large check-in totals between 9:00 and 10:00 AM and 2:00 and
3:00 PM. On Sunday, we see the same pattern occur for the 10:00 morning show;
however, the phenomenon is missing for the 3:00 show. The heat maps from our
tool in Figure 6 illustrate the
popularity of each check-in location during a specific time frame. On Friday
and Saturday, it is clear the performance stage is popular for check-ins
between 2:00 and 3:00, but Sunday’s heat map lacks any heat signature,
displaying no one checked in to the stage at that time. The line graphs
also show the reduced number of check-ins
In
conjunction with the lack of a performance Sunday afternoon, there is also an
absence of any check-ins to the pavilion after 12:00 PM. On the first two days
of the weekend, the pavilion is one of the most popular attractions outside the
time Scott Jones is performing. Contradicting this trend, there are zero
check-ins to the pavilion Sunday afternoon and evening. Thus, unlike the
beginning of the weekend, the pavilion was closed for some time.
Figure 7: Check-in numbers of all visitors from 14:00 to 15:00 for three days.
Also,
IDs 644885 and 521750 frequent the park every day together and go to the
performance stage for each show Scott Jones completes. On Sunday, these two
leave the park after the first show and do not return for the second,
confirming the second show was cancelled on Sunday.
Figure 8: Trajectories of two IDs(644885, 521750) for three days.
MC1.3
– What
anomalies or unusual patterns do you see? Describe no more than 10 anomalies,
and prioritize those unusual patterns that you think are most likely to be
relevant to the crime.
Limit your response
to no more than 10 images and 500 words.
While
a vast majority of the visitors to the park participate in similar activities
and generally adhere to expected activity, there are several anomalies which
illustrate people breaking the norms of the park.
1. One of the
largest breaks from the norm is the people who check-in only at the entrance to
the park. There are between 20 and 30 IDs throughout the weekend who simply
just check-in at the park entrance and then fail to check-in at any rides or attractions.
For example, the
ids listed below are the customers who checked in to the park on Friday, but
did not check in any other places.
Figure 9: Users who only check in at entrances on Friday.
2. There is a number
of low check-in IDs who seem to visit the park’s attractions, but do not
check-in and use them. In a comparison to reality, it is possible that these
people are grandparents or supervisors escorting others to the rides, or
watching over them without any desire to participate. However, there are some people who have low check-ins because they
are busy with the park’s attractions that are not rides, such as the Beer
Gardens, Restaurants, or Restrooms.
3. One person with
ID 392618 shows anomalous
activities. The person came to park
between 9 and 10 and acted normally until 2:55pm at the show area as shown in
the heatmap below. Then, the person moved for 6 hours in the same area without
any check-ins and suddenly jumped to building 37 around 8:43pm as shown in the
trajectory view (right black box).
Figure 10: Abnormal movement analysis of User 392618.
4.
The small number of check-in anomalies also helped present two IDs
(521750 and 644885) who followed a strange pattern. To start, they actually left
the park during the day and then checked back in to the entrance when they
returned in the afternoon. There were very few IDs who left the park and came
back within the same day, but these two traveled together and left and returned
at the same time. The two people walked around the park toward the performance
stage (#63), waited two hours outside without a check-in or any movement (a half hour
before and after Scott’s performance), and then walked around the park and
exited at 12:15 PM. They returned at 1:45 each day and completed the same
pattern of movement for the afternoon show, departing the park at 5:15 PM. Such
behavior was not reproduced by any other IDs. The picture below shows the
movement data for these two IDs on three days afternoon from 1:00 PM to 11:00
PM. They didn’t come back to the park on Sunday afternoon.
Figure 11: Trajectories of two IDs(644885, 521750) for three days from 13:00 to 23:00.
8. A
few people check-in to the pavilion more than 5 times. These are shown below:
Figure 12: Visitors who check in at pavilion more than 5 times.
10. User ID=1711922 was the person who spent the
most time at the pavilion.
This person does not check in at any attractions. There is only one check in
for exit.
Appendix
Clustering algorithm 1)
We utilize the k-means algorithm that
a mature and fast
algorithm to cluster data points. In using k-means, we consider
customers time spent on 42
attractions where customers check-in and aim to find groups of
the customers based on preference on attractions. One example cluster is the one consisted of Mr. Scott’s fans that
tend to spend most of
their time on Creighton Pavilion and Stage to see the shows. The
node-link graph generated based on distance among nodes present similarity
among clusters.
Clustering algorithm 2)
We implement sequence-based
clustering to group people based on check-in sequences in categories of
attractions. In this approach, we first find the longest common subsequence
(LCS) to measure the similarity of at least two customers sequence. Then, we
apply a density based clustering algorithm, DBSCAN to group customers.
Clustering algorithm 3)
We
also implement a check-in based clustering approach where people are clustered
based on the same check-in sequences among attractions.
Comparison
between Clustering 2 and Clustering 3: Clustering 3 aims at grouping customers considering who travel
together by utilizing check-in data while C2 groups customers with
consideration of preference of (attraction) categories extracted by sequences
of customers’ visits in attraction categories (e.g., thrill rides).